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ABSTRACT 


Least-squares estimates of regression coefficients 
are extremely sensitive to large errors in even a 
single data point. Frequently, an ad-hoc procedure is 
used to weight the data ina manner to alleviate the 


effects of extreme observations. 


This thesis is a study of the effectiveness of an 
iterative regression method using weights derived 
through maximum-likelihood arguments. Actual weights 
are calculated on the assumption of Cauchy-distributed 
error as a worst-case Situation in which the errors 


have long, fat tails and no finite moments. 
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A.  LEAST-SQUARES LINEAR REGRESSION 


It is often desirable to model the behavior of a 
response variable as a function of another Variable, 
sometimes referred to as a "carrier", since it carries 
information about the dependent variable. In the simplest 


case, the equation 


is fitted to a set of data points (x ,y ). Usually this is 
l i 

done using the "least-squares" procedure which selects the 

coefficients 7 and b. that minimize the sum of squared 


residuals, r,, defined as 
1 


waere the Gare independent and identically distributed 
1 


random variables with mean zero and constant (but unknown) 


variance. Then, by the Gauss-Markov Theorem, the estimates 





b. and b found by solving the "normal equations" 


MI 
PA 
ka 
II 
oO» 
۷۱ 
ps 
+ 
°> 
IV 
PA 
N 


are the best (minimum variance) linear unbiased estimates of 


E" and b”. 
0 1 


2-0 7 0۳ 00 01ت‎ 6 OP LEAST-SQUARES 


the least-squares procedure works very well when the € 
1 


are short-tailed and the other assumptions about the error 


hold. Tf, however, the error distribution has‏ 11-1-5651 1ك 
very long tails, implying that extreme observations may well‏ 
occur, least-squares quickly demonstrates its sensitivity to‏ 
large random error. In real data, the analyst very rarely‏ 
has a hint as to the nature of the true distribution of the‏ 


ANK Ne 51 ۲۹۹۷۶7۰ 060 ۲۶5" ہم م‎ 1389 to the central limit 
z 


theoren are frequently made along the line that there are 


several scurces of variability in the data, which, in the 
aggregate, will be "normally" distributed and thus suitable 
for least-squares. Unfortunately, if any of the errors are 
long-tailed (such as may be described by the Cauchy 
distribution), then their aggregate effect will not be 


normal. 





Figure 10 and Figure 11 in the Appendix are histograns 


of least-squares estimates for b. and b in the linear model 


Ye = LIZ (cme X) 
1 3 


The € for these estimates came from a Normal, or Gaussian, 
i 


distribution, for which the least-squares procedure is 


optimal. Figure 12 and figure 13 show the effects of 
Cauchy-distributed error on tne estimates of these 
coefficients. The end cells contain points which would 
otherwise be off the scale of the histogran, and emphasize 
that large errors in estimating the coefficients are quite 
possible when using least-squares. A uniform distribution's 
adverse effect upon the coefficient estimates is shown in 


Figure 14 and Figure 15. 


A function of Cauchy variates, 


C/100 
= Ce 


is the error density associated with the widely-varying 
coefficient estimates histogrammed in Figure 16 and figure 
007: This distribution is virtually symmetric between +12, 
cut has a long tail extending toward +0 . Another 


distribution of error, a function of normal variates, 


Z = Ne 


has high positive skewness, a little bias, and an adverse 
effect upon the least-squares estimates for b. and b as 
Shown in Figure 18 and Figure 19. 


All of these cases demonstrate that the variances of the 


10 


/ 





coefficient estimates may be drastically increased by the 
presence of non-gaussian, and especially long-tailed, 
distributions of error. While the bulk of the estimates do 
indeed fall near the actual values, there is clearly an 
unacceptable probability of obtaining an extreme estimate 


when using the least-squares procedure. 


Secon OF THE CAUCHY DISTRIBUTION 


Data disturbed by Cauchy-distributed error, with long, 
thick tails and lack of finite moments, may be considered an 
extremely difficult case for regression techniques to treat 
reliably. A procedure that works well for data subjected to 
such extremely straggly-tailed errors can reasonably be 
expected to work well, though not necessarily optimally, in 
many curve-fiting Situations. This thesis uses 
maximum-liklihood estimates for regression coefficients to 
develop a robust regression procedure, then further assumes 
a Cauchy-distributed error to apply a specific technique to 


a series of controlled regression problems. 
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II. SENCDHZERKRIER ROBUST REGRESSION 


ne: u TE Te a Sa امسا اپمت‎ ťa e o s a ee ہک سب سا‎ c F “m لګ‎ 


A.  MAXIMUM-IIKELIHOOD ESTIMATORS 


The procedure to be presented is based upon the linear 


model 


=b + b (x -x) + € 

A 0 1 i l 1 

The ۱ are assumed to be independent, identically 
i 


distributed random variables centered at zero vith spread 


(a scale parameter) and having a density of the form 


iU 


The probability for any single observation y, may be 


38 


expressed as 


12 








The likelihood function for n observations is the product of 
n of the above probilities. Taking logarithms, the 
log-likelihood function is then 


Kh VS b - b (x ex) 
imc -$ln y 2 0 1 i 
o 1 E - n Ino 


A 


Partial derivatives are taxen with respect to Da t and 
1 


md all Set equal to zero to find the b. (D. and 6 


which maximize L. Using r. above, d(x) is defined as 
1 


|f! و‎ 
W(x) = s£ ln f(x) ; 


the three equations obtained from setting the partial 


derivatives of L to zero may be written as 


" 8 5 b- b o 7 
ao s L j= Ó 
é L 
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This system of non-linear equations may be solved 


(j) 


iterativəly. Defining w, as 
i 
(j-1) 
( ._ (J Tr. 1 
1 ~ MEGA 
NS 
where the superscript (j) refers to the number of 


th 
iterations, The equations at the j iteration are 


oe) t . _ 
2 w (yn ےھ ئا رت ہے‎ = O 
1 at 0 1 1 


Z w (x -Z) (Y.- b - b (x -X)) = O 
1 1 1 0 1 i 


Bt SR « 
The first two equations are simply weighted least-squares 
normal equations which may be solved by standard iterative 
weighted least-squares algorithms in which the weights for 
each subsequent iteration are calculated from the above 
expression for لے[‎ 
Assuming the error to be Cauchy-distributed, the weighting 
formula becomes 








E INITIAL ESTIMATES 


It is necessary to begin the iterative process with an 
initial estimate, or guess, of the values of the 
coefficients. A robust estimate suggested by D. F. Andrews 
(1] using the median provides an estimate which is 
insensitive to arbitrarily large disturbances in up to 255 
of the data 

The first coefficient, b (which corresponds to the mean 


O il y rin least-sqüares estimation) is estimated by the 
1 


median of the y : 
1 


Next, the carriers, (X -X), are ordered and then broken up 
1 

into three groups of approximately equal size. Of interest 

ar the 1م12‎ group of carriers, x , the lower carriers, X , 

U L 


and the y. corresponding tO the (x =X) in each group (y and 
1 1 u 


I: 


The estimate for b is a rough slope computed from the 
1 


medians of the four groups: 


bab 
G 
= = 


fa 
I2 
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The median of the absolute values of the residuals from 


these estimates is the initial guess for í: 


الا ا 7 = رلا e - median‏ 


0 
| (1) | 
Weights w, are calculated from the residuals and ac The 
J 
algorithm then proceeds until the values of the coefficients 


stabilize. 


۸ SUMMART OF PHOCEEDLURE 


des Overall Effect 


we تم‎ e e ë e o wb aw ai as صقسست‎ com 


Figure 1 is a typical scatterplot of data which 
includes extreme observations, or "outliers", and sketches 
of representative least-squares and robust fits. The effect 
of the weighting procedure is to pull the extreme 
observations in closer to the bulk of the data, reducing 
their tendency to distort the fit (note least-squares line). 
It should be noted that both the response variable and the 


carriers are weighted in this technique. 
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Least-squares fit 


Robust fit 


alec) > 2115 WITH OUTLIERS 
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There are cases in which the robust procedure may 
not converge to a single global solution. Since the 
solution to the weighted normal equations is actually the 
solution to the three non-linear equations obtained by 
setting the partial derivatives of equal to zero, there 
exists the possibility of converging to a local soluticn not 


Optimizing E, and b. Figure 2 is an example of a local 
solution. The scatterplot represents data which actually 


has two seperate means (the data might be drive-in movie 
attendence, where observations were made only on Wednesdavs 
and Saturdays). A least-squares filit approximately splits 
the two groups of points as indicated. A robust fit may 
also split the data, but could converge to one of the two 
clusters if either is sparse enough to cause the weighting 


process to treat its points as outliers. 
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Robust fit 


Least-squares fit 





Figure 2 - LOCAL SOLUTION 
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32 lgorithm 


The following flow chart depicts the algorithm for 
the Cauchy-weighting regression method. The criteria for 
convergence (change in both coefficients of less than 0.01% 
from one iteration to the next) was somewhat arbitrary, but 
was set to meet practical expectations inanalysis problems 


and not consume excessive amounts of computer time. 
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Tr 
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0.01% 


Figure 3 - ALGORTIHM FLOWCHART 
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۶۰۰۰۰۲۰۰۰۱۷۶ :0۸ي‎ OF R2 


One of the measures of adequacy of fit for least-squares 
regression is R?, the amount of variance explained by the 


regression. It is the ratio 


2 (Y. - y)? 
1 
R2 = 
z (y. - y)? 
1 
where 
A^ » ^ A A _z 
Y. BS 7 کت‎ X) 


For least-squares, R2 is a fraction between 0 and 1, but for 
a robust procedure, the above ratio may exceed 1. This 
occurs when the robust fit is "farther" from the mean of the 


data than the least-squares fit. 
Consider the following set of observations. 


y CIDO 00. 8.00 5 

X 2500137200 340700 6.00 72390 
The nean of the Y. is 7.00 and a least-squares fit of the 
model y=b +bx to the data yields b - 3.385, 


b = 0.094 and R2 = 0.919. A robust fit would reduce the 


effects of observations (2.00,6.00) and (6.00,8.00) since 
they lie somewhat off the line through the other three 
points. A robust procedure, bringing these "extreme! 


observations in closer to the rest of the data, might well 


21 





yield coefficients of 7 = 3.000 and b - 1.000. From these 


coefficients and the data, R2 = 1.124. Figure 4 is a 

scatterplot of the observations with drawings of the actual 

least-squares fit and the postulated robust fit. Note that 

the two fits are very close, but more importantly, that the 

OU e cae Ss rotated so that (y, ~- y)? for the robust fit 
1 

is everywhere greater than or equal to the same measure for 


SM least Squares fit. 
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Robust fit 4 


Least-squares fit 
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More generally, as in Figure 1, R2 as calculated above 
is small due to the large deviations from the mean caused by 
outliers. When a response variable has only a single 
carrier, a plot of the data and the fitted line provide a 
visual evaluation of the fit. In multivariable cases, it is 
usually impossible to plot the data meaningfully, and the 
good fit of the robust line to the bulk of the data could be 


belied by inappropriately using R? as a measure of the fit. 
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III. EXPERIMENTAL PROCEDURE FOR A SINGLE CARRIER 
AR GENERAL DESCRIPTION 
A "true" model 

Vee cet ex) Fe 
i 1 1 

was established to enable comparisons of the Cauchy 

Weighting technigue and least-squares. The x were the 

3 


integers 1 through 20, and random variates were selected 

from one of five controlled error distributions to produce 

20 observations of the y. The y were then regressed on 
1 i 


the (x =X) to obtain estimates for م‎ and b, which could be 
d 


compared to the actual values. 


One thousand replications were made £ or each 
distribution and each methcd. Histograms were constructed 


EOF both b and b to reveal their distributions. 
Preliminary runs indicated that most problems converged 


(both coefficients changed less than 0.01% from one 
iteration to the next) within 10 iterations. To reduce the 
ancunt of time to perform the experiment, the 
Cauchy-weighting iterations were terminated no later than 


the seventh iteration. Values of the coefficients were 


a> 





recorded at the fourth iteration to see if there were 
significant changes between that and the final iteration. 
mBeetbesproblensconverged early, data normally collected at a 


later point were assigned the stabilized values. 
BR ERROR DISTRIBUTIONS 


Five controlled error distributions were used to disturb 
the observations. The first, the Gaussian or "Normal" 
distribution with mean zero was matched to the seccnd, a 


e uchy distribution. This was done by integrating the 


th 
Cuachy density centered at zero to find the 75 quantile, 


giving 1 as a measure of the spread of the distribution. 


th 
Since the corresponding Normal(0,1) 75 quantile- is 0.5705, 
a Normal distribution with standard deviation 1.4826 will 
have the same interquartile range as a Cauchy distribution 
with spread parameter 1. The third source of error Was a 
uniform distribution with mean zero and variance matched to 


the above Normal, giving it a range of -13.1886 to 13.1885. 


The "y" density is a function of Cauchy variates C: 


C/100 
- Ce 


It is positively skewed, but virtually symmetric between -18 
and +18 with a very pronounced central spike. Figure 5 isa 


histogram of 2000 "V" variates. 
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The final test density is a function of Normal variates 


with nean Zero and variance 1. 


N+0.01N2 
Z = Ne 


It is positively skewed and slightly biased. A histogram of 


2000 "Z" variates is shown in Figure 6. 
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A. LEAST-SQUARES ADVANTAGES 


Summary statistics for the distributions of b. and b 
for the single-carrier experiment are shown in Figures 7 and 


8. Looking at means and standard deviations, least-squares 
estimates  (maximum-liklihood estimates for normal-error 
data) are better for normally-distributed cases than the 
Cauchy method. Interestingly, least-squares 15 also 
noticeably better when the error comes from a uniform 
87 ۲۴100 ٭.‎ This result could be explained by the 


relatively broad area in which the data points may fall for 


the uniform error with respect to the range of the  (x -X) 
1 


used, and the susceptibility of the Cauchy methcd to 


convergence to local solutions. 
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Figure 9 is a diagram of the region in which data 
points may fall when the error is uniforn between 
+ 551686 د‎ Since the observations may lie anywhere in the 
region with equal probability, the weighting process may not 
be able to clearly discriminate which points are outliers. 
Chance alignments of a series of points could determine a 
local optimtm upon which the Cauchy-likelihood method would 
converge. While other distributions have longer tails, the 
bulk of their variates fall within a relatively small 
distance of their center, better defining a mean trend and 
Clearly differentiating outliers from the rest of the 


observations. 
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B. CAUCHY METHOD ADVANTAGES 


Comparing means and standard deviations for the two 
methods, the Cauchy-likelihood procedure is clearly the more 
reliable technique when the errors have long tails. Maximum 
and minimum estimates of the coefficients are closer to 
their true values when the Cauchy technique is used, and it 
never produces extreme estimates. There jis little 
difference in the estimates from the fourth to the seventh 


iterations. 
sr SIMILARITIES 


The means and medians for both methods are not 


significantly different. The coefficient estimates between 
th th | 
the. 25 and 75 quantiles are virtually the same over all 


distributions, tne Cauchy-based method doing better for the 
long-tailed distributions and the normal having sone 
advantage principally when the error is uniform. Even 
between the 10 and quantiles, the two procedures 


yield very similar results. 
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V. REGRESSION ON TWO INDEPENDENT CARRIERS 


| u m s s s ان‎ aaa me mm “as aaa aaa maya پوت د‎ aana CÁM “sme a q s s ےد‎ m پښت‎ s Ee 


A. MULTIVARIATE MODEL 


For two independent carriers, the linear model is 


assumed to be of the forn 


=b +b (Xx -X + b (x -xX + € 
٢ 0 , 11 y) 2 | 172 5) 1 


The regression can be expressed in matrix terms as 
Y = YB + E. 


Y is an nx1 matrix of n response variable observations, X is 
3-۰۰۰ matrix having all I's in the first column, the n 
observations of the first carrier in the second column, and 


the n observations of each of the remaining p-2 carriers in 


each of the remaining columns. B is a px! matrix of 

coefficients, D 1D 1D reseb , and Eis an nxn matrix of 
po? 

unknown random errors, independent and identically 


distributed, centered at zero and having constant spread. 


Using a prime (') to designate the transpose of a 
matrix, the least-squares normal equations are 


A 


KIY = KKB ٠. 


The weighted normal equations are 
(NX)'Y - (WX)'XB 
where Wis an nxn matrix having as its diagonal elements w | 


AI 


th (k) 
the k iteration weights, w  , and zeros elsewhere. The 
i 


weighted normal equations are equivalent to a system of p 


2۰٠۔٦۰5‎ 5-۱۲3 م‎ Unknowns (the b , i = 0,1,...,p-1) which can 
1 


be written as 


and is easily solved for B. 
E MODIFICATION TO INITIAL ESTIMATION PRODEDURSE 


Multiple-carrier regression problems require a 
modification to the initial estimate procedure to ensure 
that any interdependence among the carriers is renoved prior 
to estimating the effects of the carriers on the response 
variable. D. F. Andrews [1] has suggested a rather 
time-consumming method applying a robust sweep operator to 
the columns of the X matrix in an iterative process. An 
alternate method inspired by Mosteller and Tukey [6] 
sequentially regresses the carriers on each of their 


predecessors in the X-matrix to eliminate unwanted effects. 
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Multiple regression may be viewed as a sequence of 
single-carrier  regressions in which the dependent variable, 
y, is regressed on the first carrier alone according to the 
model 


y = eX, + residual 


Het. y I ("y adjusted for al be the residual after the 


9 


effects of xe are removed: 


=y- x . 
Ta 1 1 1 
This residual is set aside while the effect of 2 on D is 
removed using the model 
x = d x + ۰ 
2 2 ] 21 
i> : being د‎ adjusted for E š The residual of y 


( Y š ) 15 then regressed on D to find b in the model 


y = b x + residual 
ou 2 2,1 
5 35311111119 [or 1 and ٢ b ال‎ 0 - Rm SO 
thart 
y - bx + bx + residual 
1 1 2 2 


For a model having a mean effect 


° y 


-X ) * b (x. -X ) + residual 
1 N 2 2 


=b +b (x 
e i 
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in practice, b is found immediately, (from the median o£ 


the y, as before) and its effects not removed for 
i 

computational considerations. It is not important to remove 

the mean effect since it is independent of the carrier 


effects that must be removed. 
/ 


Estimates for Y are found using the median estimate 


1 

described above: 

0 1 

Y 00+ 

۸ a a 1 
V, = سب‎ 
1 1 1 
X . WA x . 
3 ül a 11 
where the subscript indicates the quantities have been 
a 
adjusted for all  preceeding carriers. For example, the 
^ 
estimate for 2 would require that y and I both pe 
i i 
adjusted for the effects of carrier > and carrier 25 T À 
en 
Suma lar procedure finds the d. coefficients for ths j 
J; a 
A ۸ 
carrier regressed on its predecessors. The Y ande. may 
1 j;a 
then be arranged in a system of ecuations in the b and 
1 


subsequently solved to yield the coefficients for the 


desired model. 
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G TWO-CARRIER EXPERLMAENT 


The twc-carrier experiment was analagous to the 


one-carrier tests. Coefficients De 7 and D were fixed to 


establish a known model 


oF = 13 + ۰٣۹٦ - 093 E E) : 
The D were the integers 1 through 20 in ascending  crder; 
the E were the same integers shuffled to establish 
independence in the X-matrix columns. The "true" y were 


i 
then calculated and subsequently disturbed by the same 


additive error e as in the one-carrier case. 
i 


Since the single-carrier experiment showed little change 
in the values of the coefficients from the fourth iteration 
to the seventh, the two-carrier iterations were terminated 
after four cycles (or convergence) for each of 1000 
replications for each of the five distributions. Only final 
values were recorded since the initial guesses tended to be 


somewhat unstable in the first experiment. 


DEE ESULETSEOEZTHESTMOZERRRIER EXPERIMENT 


The results of the second experiment are summarized in 
Figures 20, 21 and 22 in the Appendix. The estimates of the 
coefficients parallel the single-carrier cases exactly, with 


the excepticn of the Cauchy method applied to uniform 
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disturbances. The standard deviations of the Cauchy 
estimates for the coefficients of the uniform-disturbed data 
are slightly lower than in the single-carrier case, in 
contrast to the general trend for the standard deviations to 
be higher for the two-carrier problems. While there seens 
to be some interaction between the carriers which raises the 
standard deviations in general, the use of two carriers nay 
be reducing the tendency of the Cauchy method to converge to 


a local solution. 
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The robust method developed and tested in this paper 
demonstrates extremely stable behavior over a variety of 
distributions of random error. “Traditional least-squares 
estimaticn, on the other hand, is subject to potentially 
large errors in its estimates of regression coefficients. 
The method based on 6 161117211 12-48 weighting nas 
consistently smaller error when outliers are present and 
only slightly larger errors (though never any extreme 
errors) when the error distribution is closer to the Normal 


deastrıbution. 


The Cauchy-likelihood estimates appear to be very 


slightly tiased. The centers of the y tend to be estimated 
i 


too high, while the slopes of the carriers are consistently 
low. Possibly, if the experiment were run again with the 
Signs of the error terms reversed, the apparent biasing 
would also reverse to imply that the procedure is robustly 


. 8© 5 212 2 نا 


There are two drawbacks to the Cauchy method. The first 
is its requirement for more calculations and intermediate 
storage. The initial estimates of the coefficients alone 
require more computer assets than least-squares needs for a 
complete, though possibly erroneous, solution. As a general 
rule, the rcbust Cauchy-likelinood procedure requires twice 
the data storage capacity and five to six times as long to 
run as a basic least-squares routine. Clearly, the large 
2 05512902 1 TISK for obtaining seriously inaccurate 


estimates of regression coefficients warrants the use cf the 
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Cauchy, or scme other robust procedure in every-day data 


analysis, even with the increase in computer requirements. 


The other problem with using the Cauchy-likelihood 
method jis the possible convergence upon a local solution. 
It should be noted, however, that traditional least-squares 
will also prcduce erroneous results when used under the same 
conditions which would cause the Cauchy-based technique to 


Stabilize at a local solution. 
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Figure 22 - SUMMARY OF TWO-CARRIER b DISTRIBUTION 
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